In [1]:
from IPython.display import display
import pandas as pd
from enrondatahandling import EnronEmailDataset
In [2]:
# Load and parse the enron email dataset
enronData = EnronEmailDataset('./data')
In [3]:
# Let's take a look at the emails table
enronData.emails.head()
Out[3]:
In [4]:
# The recipients table is being maintained separately so as to not keep lists as values in the dataframe
enronData.recipients.head()
Out[4]:
Let's now do some basic analysis to see how we can use this data and play with it to get some insights and information of value.
Note: In both the questions below, I have included the people on the to list as well as the cc list and the bcc list to mean recipients.
In the next couple sections I am trying to answer the following question:
Let's label an email as "direct" if there is exactly one recipient and "broadcast" if it has multiple recipients. Identify the top 3 people who received the largest number of direct emails and the person (or people) who sent the largest number of broadcast emails.
In [5]:
directs = pd.merge(
enronData.recipients,
enronData.emails[enronData.emails['num_recipients'] == 1],
left_on='email_id',
right_index=True)[['ts', 'recipient']]
directs = (
directs.groupby('recipient')
.count()
.rename(columns={'ts': 'count_direct'})
.sort_values(by='count_direct', ascending=[0]))
directs.head()
Out[5]:
In [6]:
broadcasts = enronData.emails[enronData.emails['num_recipients'] > 1][['sender', 'ts']]
broadcasts = (
broadcasts.groupby('sender')
.count()
.rename(columns={'ts': 'count_broadcast'})
.sort_values(by='count_broadcast', ascending=[0]))
broadcasts.head()
Out[6]:
Based on the outputs above, we can say:
In the section I am trying to answer the following question:
Find the five emails with the fastest response times. Please include file IDs, subject, sender, recipient, and response times. (A response is defined as a message from one of the recipients to the original sender whose subject line contains all of the words from the subject of the original email, and the response time should be measured as the difference between when the original email was sent and when the response was sent.)
In [7]:
responses = enronData.responses.sort_values(by='response_time_in_secs').reset_index()
responses = responses[[
'email_id',
'sender',
'subject',
'email_id_response',
'sender_response',
'subject_response',
'response_time_in_secs']]
responses.head()
Out[7]:
Based on the outputs above, we can say that the five emails with the fastest response times in order are:
In [ ]: